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Abstract 

We describe a set of new estimators for the N-point correlation functions of 
point processes. The variance of these estimators is calculated for the Poisson and 
binomial cases. It is shown that the variance of the unbiased estimator converges 
to the continuum value much faster than with any previously used alternative, 
all terms with slower convergence exactly cancel. We compare our estimators 
with Ripley's Kq and K^- 
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1. Introduction 



Estimation of correlation functions from a set of points is a classical problem of 
spatial statistics. The two-point correlation functions are the most widely used, but 
there is an increasing interest in estimating higher order correlation functions as 
well. A new class of estimators was introduced in astrophysics |nj which pertains 
to most methods currently applied to data sets of galaxy positions. We present the 
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rigorous calculation of the variances of these new estimators for Poisson and binomial 
point processes. It will be shown in the next sections that for these processes the 
estimators have smaller variance than any estimator previously used for the same 
statistics. While the results are completely general, they were motivated by future 
astrophysical applications, therefore all the examples will be taken from there. Higher 
order correlation functions from galaxy catalogs are routinely estimated since the 
70's. With a new generation of galaxy catalogs coming on line in the next five years, 
understanding the estimators is an important and timely problem. 

A unique feature of spatial statistics is that errors of a measurement are often 
dominated by geometrical terms, like edge effects [§. Thus ever since correlation 
functions were estimated, corrections for edge effects have played a central role. The 
estimators for correlation functions in the astrophysical literature is reviewed in |]llf , 
here we only quote a few selected additional references || [l], [|, [|. It is generally 
accepted that the most efficient estimator for the two-point function is that of Q , or 
its relative Q. 

The new estimator can be approximated in the continuum limit (achieved when 
the number density of data points — > oo) as W2 — (6162), where () denotes ensemble 
average, and S is the fluctuation of the continuous (galaxy) density field p. It is 
defined as p — (p) (1 + S), thus (S) = 0. The Monte Carlo representation of this 
estimator is often written symbolically as (DD — 2DR + RR)/RR, with DD, DR, 
and RR representing the respective pair counts. The important point is that the above 
estimator contains only the most necessary terms in the continuous limit, while all 
others, such as DD/RR - 1 -> ((1 + 6i)(l + S 2 )) - 1 = (M2 + <5i + h), have extra 
terms. These extra terms do not affect the ensemble average of the unbiased estimator, 
but increase the variance. The deceptively simple look of the second estimator in 
terms of D and R was the reason for its popularity. 

Ripley || has discussed extensively the variance of second order estimators for 
Poisson and binomial point processes. He has shown, that the C(n _1 )term in the 
variance of the simple estimator, K , is proportional to u, the perimeter for a two 
dimensional domain. This implies that the effect is due to inadequate edge corrections, 
in agreement with Hewett's fl suggestion. The subtraction of the appropriate DR 
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terms is equivalent to an edge correction. 

The effect of the extra terms on the variance is even more pronounced for the higher 
order functions, since there will be a lot more terms arising through various combi- 
natorial expressions. We proposed intuitively jflj that the obvious generalization for 
higher order correlations is to create higher order equivalents of the estimator Q. 
With Si ~ (Di — Ri)/Ri this corresponds to (5\...5n)- In symbolic notation, this 
estimator can be written as 

(1) w N = {D 1 - Rx).(D 2 - R2) ■ ■ ■ (D N - Rn)/Ri ■■■Rn- 

This corresponds to the Monte-Carlo approximation of the continuum limit of the 
TV-point correlation function of fluctuations; the exact meaning is discussed later. 
The most significant result is that the correlations of the fluctuations automatically 
correct for edge effects for Poisson and binomial point processes. 

Different approaches to edge effects exist; for a review see jlO|, It remains to 
be seen whether these geometrically motivated estimators, or their generalization for 
higher order § fare better than the Q estimator, or the related estimator introduced 

by !■ 

The main goal of this article is to rigorously derive the variance of the estimator for 
Poisson and binomial processes. The next section presents the analytic calculation 
of the variance for arbitrary N. Section 3 compares the second order estimator with 
that of the related Ripley's K function. The last section summarizes the results. 



2. Variance of the Edge Corrected Estimators in the Poisson and Binomial 
Point Processes 

Many interesting statistics, such as the TV-point correlation functions and their 
Fourier analogs, can be formulated as functions over TV points from the catalog. The 
covariance of a pair of such estimators will be calculated for Poisson and binomial 
point processes. They correspond to the cases, where the number of detected objects 
is varied or fixed a priori. The general case, where correlations are non-negligible is 
left for future work. The following calculations heavily rely on the elegant formalism 
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outlined in Ripley ||, which can be consulted for details. 

Let D be a catalog of data points to be analyzed, and R randomly generated over 
the same area, with averages A, and p respectively. The role of R is to perform a 
Monte Carlo integration compensating for edge effects, therefore eventually the limit 
p — » oo will be taken. A on the other hand is assumed to be externally estimated 
with arbitrary precision. We also assume that the correlations in the point process 
are weak, i.e. we operate in the Poisson limit. 

Let us define symbolically an estimator D p R q , with p + q = N for a function i> 
symmetric in its arguments 

(2) DPRi = J2^u---,x p , yi ,...,y g ), 

with Xi ^ Xj G D,yi ^= yj G R. For example for the two point correlation function 
^(x,y) — [x,y € D,r < d(x,y) < r + dr], where d(x,y) is the distance between the 
two points, and [condition] equals 1 when condition holds, otherwise. Ensemble 
averages can be estimated via factorial moment measures, v s In the Poisson 

limit v s = X s p s , where p s is the s dimensional Lebesgue measure. 
The general covariance of a pair of estimators is 

(3) {D^RfD^Kf) = J2 (7) (^)j! W P1+P2 -> 91+92 ~ J , 
with pi + qi = p2 + </2 = N, and 



(4) S k = J $a(xi ■ ■ -x k ,yk+i ■ ■ -yN)$b(xi ■ ■ -x k ,z k+ i . . .Z N )p 2 N-k- 

Throughout the paper we use the convention that (j) is nonzero only for k > 0, 1 > 0, 
and k > I. Here $ a and <&& denote two different functions, for instance corresponding 
to two radial bins. The expression simply describes the fact that out of the p\ and p2 
different data points in D we have an i-fold degeneracy, as well as a j-fold degeneracy 
in the random points drawn from R. For each of these configurations the geometric 
phase-space Si+j is different, and we sum their contributions. The dependence of Sk 
on a, 6, and N is not noted for convenience, but they will be assumed throughout the 
paper. An estimator for the generalized iV-point correlation function is 

1 wn, 



(5) ^ = _^^J(_ r -. ( _).(_) 



N -point Estimators 5 

where S — J §pn (without subscript). This definition can be expressed as (D — R) N , 
where * means normalization with A, p respectively. In this symbolic Nth. power, 
each factor is evaluated at a different point. Simple calculation in the limit of zero 
correlations yields (u>jv) = 0. 

Theorem 1 The asymptotic covariance between two estimators of the above form 
for a Poisson point process in the limit of p — > oo is 

(6) (co)Var A Wat = {w a . N w b , N ) = ■ 

Proof. According to Eq. ^ the covariance can be written as 

(7) 

> - e o o («) co«c j ") r j 

In the interesting limit, where p — » oo only j ' = survives. Changing the order of 
summation yields 



(8) (Wa.JVW^Jv) = ^ ^2 S * X lil fNi> 

i 

with 



0) fm = j:( N )( j M-l) N -' 



3 



3 / V 



This latter can be identified as the coefficients of ^2 N (xy) N , therefore /jyj = 
This in turn proves the theorem noting that (wn) = 0. This formula represents both 
variance and covariance depending on whether in the definition of Sn the implicit 
indices a and b are equal or not. 

While in the Poisson model the total number of points in the domain can vary, it 
is fixed in the binomial model. This latter case corresponds to surveys, that detect a 
certain number of galaxies, and use that to estimate the mean density as well. In a 
sense, this would be the conditional estimator of the correlations given the number of 
galaxies. The normalization of the estimator changes slightly: A 1 — ► (n)i/v l , where n 
is the total number of objects in the survey, and (n)i = n(n — 1) . . . (n — i + 1) is the 
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i-th falling factorial. This renders the definition of the estimator for binomial process 
as 

1 V- f N \, ,N-i(Dvy (Rv) N ~ i 



do) ^=^e ; h 



(n)i (n r )jv-» 



where n r is the number of points in the auxiliary random process R. 

For a binomial process the factorial moment measure is (nJjvf^fiAr, with v, the 
volume of the survey. This fact enables the proof of the following theorem. 

Theorem 2 The asymptotic covariance of two estimators for a binomial point pro- 
cess in the limit of n r — > oo is 

(11) (co)Var n ^ = i^) * (-)""'. 

Proof. First it is convenient to prove the following lemma 
Lemma 1 For all possible integer values of N, n, and i 

<»> E(r-X--X:*V" -<->"". 

tl,»2 

where the summation is over all possible values of i\ and 

Proof. It follows by induction over N . For N = i it is true, since N = i = i\ = i2 
are the only possible values. Thus 

w> (T)(v)(o)<->*=<->° 

for any n and i. Assume it is true for a particular N for any n and i. Then for A + 1 

»1i»2 

By introducing m = n — 1, fc = i — 1, fci = ii — 1, and fc 2 = «2 — 1 this reads 



<-) £(r3(r^)(^ )<-»*■* -<-> 

fci,fc 2 

which is true by induction. 



N-k 
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Now to prove the theorem consider the equation for the covariance using the 
appropriate factorial moment measure for binomial point process. After n r — > oo 

(16) 

1 a ! -/ JV V^MM( ii ).i+. 2 -(-) ,1+ ' 2 



Applying the lemma for each i separately, the theorem is proven. 

s CO CO ( ~ )n+42 (n " {n ~ i2)n - 12 



(17) N\(n) N \ N )(-) N -> 



For N — 2 the theorem coincides with B, taking into account that S 2 — So — 
(S 2 v 2 ) 2 . 

3. Discussions 

For practical applications the function $ has to be specified. For instance, $ = 1 
when the A-tuplet satisfies a certain geometry (with a suitable bin width), and 
otherwise yields the total (or disconnected) TV-point correlations of the fluctuations 
of the process. See [[ll] for detailed discussion of possible choices for the function $ 
to render popular statistics for the distribution of galaxies, such as power spectra, 
cumulant correlators, etc. Here we concentrate on the comparison with other second 
order estimators. 

The number of neighbors from a point within a distance of < t is defined as XK(t) 
U . A family of estimators denoted by Ki was introduced || with subtle differences in 
edge correction. It was found that Ripley's K 2 has the smallest variance of all. Similar 
conclusions were reached in a numerical setting motivated by potential astrophysical 
applications R|. The difference between these estimators and ours is twofold: first, 
the normalization is different, second, they estimate the moments of the full point 
process while wn deals with the moments of the fluctuations. The former point is 
trivial, while the latter is crucial, as shown later. When Wz is used to extract the 
(differential) two-point correlation function, the connection with Ripley's cumulative 
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K can be expressed as K(t) = A J Q * dr(l + u?2(r)). Note that in astrophysics the 
prevailing choice is w, and K for the most part would be estimated for the purpose 
of eventually obtaining w from it. Nevertheless, in other disciplines, or perhaps even 
for certain aspects of astrophysics, K could be more advantageous. 

Next we show that W2 has smaller variance than any of the Ki . For a pair-estimator 
T = ^2 x -iy$(x,y) the variance for a Poisson process can be expressed in a quite 
general fashion. Using the notation of the previous section, the variance is Vav\T = 
4A 3 5i + 2A 2 S , 2 H . Ripley has derived an approximation of this formula for the "naive" 



estimator Ko(t) — aT/n 2 , Va,r\K (t) = 



1.34A^ 



where a is the 



area of the two-dimensional domain, and u is its perimeter. 

In order to compare the variance W2 and T (Kq up to normalization), we scale T 
by 5A 2 



(18) 




25' 



2 



5 2 A 2 ' 
25*2 45j 
5 2 A 2 5 2 A' 



The Poisson terms in the number of pairs, ~ A~ 2 , are identical, while the 0(1/ X) 
terms are missing from the variance of u>2- The latter can be appreciable when A is 
small for the estimator Ko based on T. 

The most clever estimators of K, such as Ripley's K 2 , suppress this term consider- 
ably, nevertheless it is always present. E.g., the variance of K 2 for a Poisson process 
is @ Var A # 2 (t) = £ + 0.96^ + 0.13As£] , 

In general for any N, all contributions (D(X~ k ), k < N, exactly cancel for wn. Since 
terms higher than 0(X~ N ) are absent, the asymptotic behavior of our estimator is 
optimal: the only possible improvement for Poisson (or binomial) processes is perhaps 
to decrease the multiplicative factor. 

In fact K 2 suppresses the A -1 term at the expense of boosting the coefficient of A~ 2 
compared to Kq. In contrast, for W2 no such boost is present according to Equation [l8], 
the coefficients of the Poisson term are identical to that of the Kg estimator. This 
suggests that the estimator wn is probably close to optimal for Poisson (and binomial) 
processes. We conjecture that this is approximately true for many correlated point 
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processes as well, although for extreme cases, such a line segment process, a small bias 
was noted by the numerical investigations of for both W2 , and for other estimators 
related to the K function. 

The intuitive meaning of the results for binomial process is clear from Theorem 
2.: the variance of the estimators is oc (") , i.e. Poisson in the number of N- 
tuplets, with multiplicative factors depending on the available geometric phase space. 
For the Poisson process, Theorem 1. is identical up to discreteness effects: it is 
inversely proportional to the probability density of finding TV-points in the domain, 
oc (A^)- 1 ~ (n/v)- N . 

Our approach to suppressing the offending non-Poisson terms involves an auxiliary 
Poisson process (i?), for which the average density is assumed to be infinity p — » oo. 
The limit is not possible in practice, therefore it is desirable to evaluate the speed of 
convergence. A calculation analogous to the proof of Theorem 1. shows that 

(19) (co)Var A ,> JV = §^ jl + ± + 0(A/p) 2 | . 

This generalization of Theorem 1. clearly shows that for a finite auxiliary Poisson 
process R, there is indeed a term with 1/A, but suppressed by a large factor. For 
instance, the relative non-Poisson correction is A/(2p) for N — 2, giving less than a 
percent contamination when the auxiliary process is more than fifty times the original 
process to be measured. This is fairly convenient, since, unlike for the original process, 
we have full control over the artificially introduced process. Note that the above 
considerations assume exact knowledge of the average count A, thus no conditioning 
on the number of points is assumed. 

For a binomial process the argument is exactly analogous to the previous one, 
therefore we only outline it briefly. The variance for a general estimator T is fl 
Var„T = An(n - l)(n - 2)a 3 ,5*i - n(n - l)(4n - 6)a" 4 5 2 + 2n(n - l)a,- 2 S 2 . This is to 
be compared with Theorem 2. Var n u>2 = 2[Sq — 2S\a + S^a 2 ]/ S 2 n(n — 1) (replacing 
v with a for two dimensions). Again T(n(n — l))~ 2 a 4 has to be considered because of 
the normalization. The first two terms in the previous equation will yield n _1 terms, 
while the variance of wi is again Poisson in terms of the number of pairs. Finally, the 
convergence properties are expected to be similar to the Poisson process, although 
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we have not performed the calculations. 

It is worth to emphasize again that, while both are two-point measures, w and K 
are slightly different objects. In astrophysics w is the desired quantity, but in some 
cases K might be more advantageous; then Ripley's Ki is still the preferred estimator. 
Clearly, these findings remain true for N > 2 as well. 



4. Summary 

In summary we have calculated the variance of a new class of estimators wn for the 
A-point correlation functions for Poisson and binomial point processes. The results 
were compared with variances concerning a different class of estimators based on 
the K function. The main difference is that wn estimates the TV-point correlation 
function of the fluctuations of the point process, while Ki estimate the (cumulative) 
moments of the full point process. This property apparently renders the edge effect 
correction in wpj exact, leaving only the Poissonian contribution to the variance in 
terms of A-tuplets, i.e. terms oc 1/X N , or 1/(jv) ^ or P°i sson an d binomial processes, 
respectively. All lower order terms, dominating for sparse processes, exactly cancel. 
This is not true for any of the estimators for K, although the best ones achieve a 
significant suppression of the offending non-Poisson terms at the price of boosting the 
constant factor multiplying the Poisson terms with respect to K$, the naive estimator. 
The speed of convergence was calculated for the new estimator wn, which assumed an 
auxiliary Poisson process R, with the average p — > oo. The leading order non-Poisson 
term in the variance was found to be suppressed by a large factor X/pN, thus the 
convergence is controllable in practice when p is finite. 
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